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Abstract 

Demographic and behavioral characteristics of journal authors are important indicators of 
homophily in co-authorship networks. In the presence of correlations between adjacent nodes 
(assortative mixing), combining the estimation of the individual characteristics and the network 
structure results in a well-fitting model, which is capable to provide a deep understanding of 
the linkage between individual and social properties. This paper aims to propose a novel proba¬ 
bilistic model for the joint distribution of nodal properties (authors’ demographic and behavioral 
characteristics) and network structure (co-authorship connections), based on the nodal similarity 
effect. A Bayesian approach is used to estimate the model parameters, providing insights about 
the probabilistic properties of the observed data set. After a detailed analysis of the proposed 
statistical methodology, we illustrate our approach with an empirical analysis of co-authorship 
of 1007 journal articles indexed in the ISI Web of Science database in the field of neuroscience 
between 2009 and 2013. 

Key words: Bibliometrics, Social networks. Co-authorship networks. Nodal similarities, Bayesian 
inference, MCMC. 

1 Introduction 

The increasing specialization of scientific research, the interdisciplinary character of most projects, 
and the increased funding of cross-institution initiatives have made scientists take part in collaboration 
networks (Teixeira da Silva 2011, Haeussler and Sauermann 2013). Co-authorship networks represent 
a widely studied class of collaboration networks. They have been extensively studied using different 
statistical approaches (Newman 2004, 2003), with the purpose of identifying the structure of scientific 
partnerships and the role played by the individual researchers. 
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The majority of methodological contributions focus on modelling the structure of scientific co¬ 
authorship, based on the projection of a two-mode network (author-paper network) into a one¬ 
mode structure of co-authorship (author-author network), where links represent co-authors, i.e., 
authors sharing common papers, as described by Leydesdorff and Wagner (2008). We use a similar 
approach here. Our main contribution is to combine individual and social properties of journal 
authors, by internalizing the effect of authors’ similarities in their patterns of connection. It provides 
an insight into the level of homophily in co-authorship networks, in terms of specific socio-demographic 
characteristics (Newman 2003), while accounting for relevant network features based on observed 
nodal properties. 

The notation we use denotes V the set of N authors and £ <ZV^V their known structure of connec¬ 
tions; /C denotes a set of K of categorical properties (in our application, /C = {genders^ nationalities}) 
defined for each author in V. The nodal similarities are assumed to reflect the overlap of authors’ 
categorical statuses, with respect to the properties in JC. 

An exponential random model is proposed to internalize the effect of nodal similarities on the 
joint distribution of network and authors’ properties (Caimo and Friel 2011, Robins et al. 2007). 
Exponential families possess good properties that typically simplify the statistical inference of model 
parameters. As we explain in Section 3, the inclusion of nodal similarities as sufficient statistics for the 
exponential random model entails the impossibility of a complete characterization of the probability 
distribution, due to the intractability of the normalizing constant. This represents one of the strongest 
barriers to the numerical-optimization of the likelihood-function and legitimates the use of simulation- 
based approaches - such as the Monte Carlo maximum likelihood of Geyer and Thompson (1992) and 
pseudo-likelihood estimation of Strauss and Ikeda (1990). 

As suggested by Caimo and Friel 2011, this drawback can be overcome by embedding the defined 
model into a Bayesian estimation framework, which reformulate the estimation problem based on the 
ability of simulating from the posterior distribution. We build on Murray et al. 2006, which proposed 
a MCMC method to simulate from this class of distributions, allowing a flexible estimation of the 
effect of nodal similarity - which is the main scope of this paper. Our approach is able to generate 
the following insights: 

• we can estimate author’ collaborations based on their demographic and behavioral characteristics; 

• we can estimate author’ demographic and behavioral characteristics based on their pattern of connections. 

In other words, our model connects nodal properties with network structure, so one can be used to 
predict the other. We illustrate our method through the analysis of co-authorship of over a thousand 
journal articles between 2009 and 2013 in the neuroscience research community. 

Section 2 introduces and describes the co-authorship data set, along with the relevant network 
statistics we aim to control in a probabilistic model. Section 3 provides a detailed description of the 
proposed exponential random model for this type of data set and embeds such model in a general 
Bayesian framework. Section 4 takes into account the algorithmic aspects of the estimation of the 
model parameters. The numerical results are presented in Section 5. Section 6 concludes. 
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2 Presenting the co-authorship data set 


The data set is composed of the scientific publications indexed in the Web of Science (WOS) database 
between 2009 and 2013 in the field of Neuroscience. 153,182 research papers were retrieved in the 
first step. Then, we conducted stratified random sampling. The sample size was determined with a 
3% sampling error and 95% of level of confidence. Table 1 shows the total number of publications 
and the stratified sample size per year in the studied field. 


Year 

^ publications (%) 

Stratified sample size 

2009 

28,819 (18.81%) 

199 

2010 

30,154(19.69%) 

208 

2011 

31,030 (20.26%) 

214 

2012 

31,265 (20.41%) 

218 

2013 

31,914 (20.83%) 

221 

Total 

153,182 

1,060 


Table 1: The total number of publications and the stratified sample size, 2009-2013. 



(a) First largest component. 


(b) Second largest component. 


(c) Third largest component. 


Figure 1: The three largest components with nodal genders (blue for men, pink for women). 
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Figure 2: The three largest components with nodal nationalities. 
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Finally, after eliminating those papers whose gender was unclear (53 of them, 5%), our data set 
comprised 1,007 (95%) of 1,060 papers. These 1,007 papers were used as our data set for further 
analysis, corresponding to 5, 385 authors. Each author was assigned to a nationality based on the 
authors’ country of affiliation. Thus, for each author, two demographic characteristics were collected: 
gender and nationality. 

A network structure of scientific collaboration between authors was then generated by connecting 
those authors whose names jointly appear in one or more of the 1,007 articles. The resulting network 
comprised 207 disconnected components and the three largest had respectively size of 55, 53 and 35, 
as shown in Figures 1 and 2, associated with the corresponding nodal gender and nationalities. 

3 Model definition and specification 

This section proposes an exponential random model which internalizes the structure of dependencies 
of individual characteristics and network structure. Consider a categorical variable with ruk 
categories defined on a set of N individuals, and its representation in term of an Y x rrik binary matrix 
yk E {0, possible realizations of Yk- Similarly, let Z be the adjacency 

matrix of a random network with N nodes and Z C {0, the set of its possible realizations. 

The sample space under consideration can be defined as T = x Ti x • • • x 

- the set of network structures among N individuals, taking K categorical properties, as illustrated 
in Figure 3. 


1 ... N I ... nil 1 ... 



3^1 

... ' 

yK 


Figure 3: Sample space. 


Let X be a random matrix, taking values in X and x a possible realization. In the exponential 
family of distributions the conditional probability of x G A takes the following form: P(x | 9) oc 
exp(T(x)^0), where 0 is a vector of natural parameters of the distribution, which can usually take 
any value in the reals; and r(x) is a vector of sufficient statistics. 

The exponential random model developed here internalizes the nodal similarity effect by includ¬ 
ing the matching in the categorical properties of each pair of nodes as a sufficient statistic of the 
exponential distribution: 


P(x I a,/3,7) oc 


exp [a^^(y) + /3'^S(z) + 7 '^G(y,z)] 

0 


if X G T 
otherwise 


( 1 ) 


where y is a matrix whose r) component represents the dummy indicator of whether an individual 
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r has value hk in the categorical variable, for hk = 1.. -rrik, r G V; z is the adjacency matrix 
whose (r, s) component represents the binary indicator of the existence of a connection between r G V 
and s G V. ^(y) is a vector of sufficient statistics which accounts for combinatorial properties of 
the categorical variables y (such as the number of nodes per each level of each categorical variable, 
number of associated categories, etc.) only, and a is the corresponding vector parameter. Similarly, 
B(z) is a vector of sufficient statistics which accounts for combinatorial properties of the network 
structure z (such as the clustering coefficient, the assortativity coefficient, the average path length, 
etc.), but independent of nodal exogenous properties; and f3 is the corresponding vector parameter. 
The interaction between nodal characteristics and connections variables is internalized into the model 
by the sufficient statistics G(y,z); 7 is the parameter vector associated with these interactions. 

The specification of the sample space X can incorporate both network and nodal properties, in 
accordance with our modeling assumptions and our need to control specified combinatorial properties 
(Castro and Nasini 2015). In other words, P(x | a, 7 ) = 0 if x does not satisfy a set of feasibility 
constraints. As illustration, three possible sample spaces are shown in (3) by exogenously fixing the 
degree sequence, the number of edges and the size of each categorical level. They are specified in 
term of the solution sets of systems of linear constraints. Note that intersections of these sets give 
rise to hybrid sample spaces with complex combinatorial structures. 

m/e 

^ yh^r = 1 k^l...K,r^l...n 

hk —1 
n 

'^Zrs = ds r eV 
r=l 

rrik 

^ yh^r = 1 k^l...K,r^l...n 

/l/e = l 

Zrs = d 

(r,s)eVxV 
ruk 

X] Vh^r = 1 k = l...K,r = l...n 

hk —1 
m 

^ ^ UhkT ~ fk k = 1... 

r=l 

Classical inferential methods for the model parameters of (1) are encumbered by the intractability 
of the normalizing constant, which makes the numerical optimization of the likelihood function very 
challenging. The next section describes an MCMC algorithm to simulate from the Bayesian posterior 
distribution of the parameter. 


fixed degree sequence 


fixed number of edges 


fixed categorical levels 
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4 Estimation method 


As noted by Murray et al. (2006) and by Caimo and Friel (2011), the intractability of the normalizing 
constants of most random network models entails a “double intractability” of the posterior distribution 
when the model is embedded in a Bayesian framework. This is also true for model (1). MCMC 
algorithms are often used to draw samples from distributions with intractable normalization constants. 
However, they do not apply to a doubly-intractable constant. 

Consider the kernel of the probability function (1) and let G T be the observed data set -the 
co-authorship network structure the nodal genders the nodal nationalities Given a 

prior distribution 7 r(Q:,/ 3 , 7 ), apply the Bayes rule: 


P(a,/3,7 I x(°)) 


P(x(°) I a,/3,j)7r{a.,f3,j) 



a, /3^ 'y)7v{cx^ /3, 7 ) dcx d/3 d'y 


Since both | oc^/S^'j) and P(q:,/3,7 | x^^)) can only be specified under proportionality 

conditions, Murray et al. (2006) proposed a MCMC approach which overcomes the drawback to a 
large extent, based on the simulation of the joint distribution of the parameter and the sample 
spaces, conditioned to the observed data set xq, that is to say, P(x, a, /3, 7 | xq). We follow the 
same approach. Our application of the Metropolis-Hastings method (Bolstad 2009) to simulate from 
such distribution is summarized in Algorithm 1 . 


Algorithm 1 Exchange algorithm of Murray et al. (2006). 


Initialize (a, /3, 7 ) 
repeat 

Draw (a', /3', 7 ') from h{. \ a, /3, 7 ) 

Draw x' from P(. | a', /3', 7 ') 

A ^ ! I a' I\ 1, u-1-^ • -P(x' I Q:>/3,7)P(x(°M Q:'>/3',7')7i'(a',/3',7') 1 

Accept (a', /3, 7 ') with probability mm p, . . - a / 1 —TST—ITT- \ 

\ P(x(0) I a,/ 3 , 7 )P(x' I a',/3 , 7 ') 7 r(a,/ 3 , 7 ) J 

Update (a, /3, 7 ) 
until Convergence 


The distribution /i() is used to simulate candidate points from the posterior and it is here assumed 
to be symmetric. Note that in step 3 of Algorithm 1 a new value of the parameters (a', /3', 7 ') is 
randomly proposed and in step 4 a sample from X is simulated with probability given in (1). Clearly, 
this is a computationally intensive procedure. 

5 Numerical results and analysis 

In this section the described probabilistic model is applied to the second largest component of the 
co-authorship network in the neuroscience community, as shown in figures 1 and 2, where n = 54 
(nodes), iC = 2 (categorical properties), mi = 2 (genders), m 2 = 9 (nationalities). The sample space 
is defined as the Cartesian product between the set of n node undirected networks with fixed number 
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of edges d = 234 and the set of all possible realization of 2 categorical variables with 2 and 9 levels. 
The following specification of model (1) is taken into account in this section 


P(x) oc 


exp 

m/e 

T T ^k,hk 

f T/ 1 

1 +5^7^ 

^ ^ ^rs 

/ rrik \ 

1 yhkrVhkS 1 

0 

fce/c hk=i 

VrGV / 

keJC 

(r,s)eVxV 

\hk=l J _ 


if X G T 

otherwise 


( 2 ) 

The vector of sufficient statistics thus contains the total amount of nodes for each property and 
each level, and the association between edges and two nodal properties The corresponding natural 
parameters are ak^hk 'jk- 

It is important to keep in mind which interpretation should be given to the estimated natural 
parameters for = 1 ... k ^ 1C. In the case of uniform distribution within the 

sample space T, we should have o^k.hk — ^5 hence deviations in ^ 0 be interpreted as 

incorporating different proportions of the properties within the nodal population (e.g., nationalities 
not being evenly present in the sample). Moreover, if nodal and network properties are independent, 
then — 0 : the association between connections and nodal similarities has no effect on the probability 
of observing a given configuration in X. Any non-zero value of the natural parameters entails a 
deviation from such independence. Specifically, if 'jk — 0, ( 2 ) is reduced to the product between 
the probability mass functions of K multinomial random variables Multinom(e^^’i ... and the 

Erdos-Reniy random Graph model with fixed number of edges d. In the numerical results we set 
^k — 0 diS di null model for the computational comparison. 

The contour plot in Figure 4 show the estimated marginal posterior of ( 71 , 72 ), corresponding to 
the gender and the nationality effect, for the second largest component of the co-authorship data set 
in Section 2 . These results have been obtained by 3 chains with 30,000 MCMC iterations. 




(a) Contour plot of the empirical posterior 


(b) Bidimensional empirical posterior 


Figure 4: Marginal posterior of ( 71 , 72 ), corresponding to the second largest component of the co-authorship data set. 


Figure 4 reports a positive expected effect of the nationality (0.45) and a negligible expected effect 
of the gender (0.03). It can be noted a much larger variability on the gender effect, suggesting a lack 
of information about the parameter 71 . 

After estimating the model parameters (a, 7 ), a sample of 10,000 elements from X has been 
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simulated and the nodal similarities have been computed, both for the estimated model and the null 
model (a, 0). Figure 5 shows the observed and expected nationality similarities for each of the 
54 X 53/2 pairs of authors. It can be noted that the null model seems to be severely unable to capture 
the pairwise similarities, whereas the expected values of yhkrUhks under the estimated model 

in (2) substantially resemble the observed similarities. 



(a) Model (2) (b) Null model (c) Observed data set 


Figure 5: The values of the nationality similarities yh^ryUks for the 54 x 53/2 pairs of nodes . 

Tables 2 and 3 reports the expected proportions of edges associated to different combinations of 
genders and nationalities respectively. The estimated model seems to be able to effectively capture 
the assortative mixing of the data, i.e., the association between nodal properties and connections, 
along with the total amount of each individual categories and network collaboration density. This 
was indeed our initial intention, when a joint model for author’s characteristics and collaboration 
pattern was introduced in Section 3. 



male female 

total 

male 

female 

0.40 (0.48) 0.46 (0.40) 

0.14(0.12) 

0.86 (0.88) 

total 

0.60 (0.52) 



Table 2: Expected proportions of edged for each combination of genders. The observed proportions are reported 
within parenthesis. 

























Italy USA Spain Sweden S. Africa Japan Serbia Russia UK 

Italy 

0.004 (0.004) 0.004 (0.000) 0.014 (0.017) 0.001 (0.025) 0.000 (0.008) 0.007 (0.000) 0.000 (0.000) 0.000 (0.000) 0.000 (0.000) 

USA 

0.260 (0.235) 0.120 (0.031) 0.012 (0.000) 0.006 (0.000) 0.060 (0.004) 0.005 (0.000) 0.000 (0.000) 0.005 (0.023) 

Spain 

0.267 (0.329) 0.020 (0.024) 0.010 (0.009) 0.111 (0.064) 0.010 (0.201) 0.000 (0.000) 0.010 (0.000) 

Sweden 

0.001 (0.017) 0.001 (0.006) 0.010 (0.000) 0.000 (0.001) 0.000 (0.008) 0.000 (0.000) 

S.Africa 

0.001 (0.006) 0.002 (0.000) 0.000 (0.000) 0.000 (0.000) 0.000 (0.000) 

Japan 

0.085 (0.140) 0.000 (0.000) 0.000 (0.000) 0.000 (0.000) 

Serbia 

0.000 (0.000) 0.000 (0.000) 0.000 (0.000) 

Russia 

0.000 (0.000) 0.000 (0.000) 

UK 

0.000 (0.000) 


Table 3: Expected proportions of edged for each combination of nationalities. The observed proportions are reported 
within parenthesis. 


In Figure 6 a graphical comparison between the expected (dark-grey) and the observed (light-grey) 
degree sequences is provided. 



D 


Figure 6: In light-grey the observed degree sequence. In dark-grey the expected degree sequence. 

To summarize, for the second largest component of the co-authorship data set, the estimated 
results confirm the null effect of the gender similarity on the author’s connections, along with a positive 
effect of their nationalities. The estimated model seems to properly fit the empirical observation, as 
suggested by table 2 and 3. The expected degree sequence also resembles the observed one, suggesting 
the ability of the model to capture both individual and structural properties of the co-authorship data. 

6 Conclusion 

This paper presented an exponential random model for author’s characteristics and collaboration 
pattern in bibliometric networks, which allowed to combine the analysis of multivariate data set with 
the one of the assortative pattern of nodal similarities in networks. We proposed a Bayesian estimation 
framework and a specialized MCMC algorithm to simulate from a “doubly intractable” posterior 
distribution. We showed a strong capability of the model to account for relevant network features 
(the degree sequence) based on the observed nodal properties, providing a deep understanding of the 
linkage between individual and social properties and a substantial insight into the level of homophily 
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in CO-authorship networks. 

Our results suggest several lines of work for future research. We could compare the model fit for 
different specifications of the sample space Af. We could also study the inclusion of further nodal 
properties, such as age, principal keywords or number of received citations. 
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